pacman::p_load(olsrr, sf, sfdep, GWmodel, tmap, tidyverse, gtsummary, SpatialML,rsample,Metrics, jsonlite,httr,rvest,sp)Take_Home_Ex03_Updated
1. Overview
Housing is an essential component of household wealth worldwide. Buying a housing has always been a major investment for most people. The price of housing is affected by many factors. Some of them are global in nature such as the general economy of a country or inflation rate. Others can be more specific to the properties themselves. These factors can be further divided to structural and locational factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.

Conventional, housing resale prices predictive models were built by using Ordinary Least Square (OLS) method. However, this method failed to take into consideration that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998). In view of this limitation, Geographical Weighted Models were introduced for calibrating predictive model for housing resale prices.
1.1 The Task
In this take-home exercise, we are tasked to predict HDB resale prices at the sub-market level (i.e. HDB 3-room, HDB 4-room and HDB 5-room) for the month of January and February 2023 in Singapore. The predictive models must be built by using by using conventional OLS method and GWR methods. We are also required to compare the performance of the conventional OLS method versus the geographical weighted methods.
2. Data Acquisition Source
For the purpose of this take-home exercise, HDB Resale Flat Prices provided by Data.gov.sg should be used as the core data set. The study should focus on either three-room, four-room or five-room flat and transaction period should be from 1st January 2021 to 31st December 2022. The test data should be January and February 2023 resale prices.
In addition, we will also include other locational factors such as proximity of HDB to eldercare services, and shopping malls etc for considerations.
Data Summary Table
| Type | Name | Format | Source |
|---|---|---|---|
| Aspatial | HDB resale flat prices | .csv | data.gov.sg |
| Geospatial | Master plan 2014 subzone web boundary | .shp | data.gov.sg |
| Geospatial (Locational factor) | Elder care services | .shp | data.gov.sg |
| Geospatial (Locational factor) | Hawker centres | .geojson | data.gov.sg |
| Geospatial (Locational factor) | MRT stations | .shp | |
| Geospatial (Locational factor) | Supermarkets | .geojson | data.gov.sg |
| Geospatial (Locational factor) | Student care services | .geojson | data.gov.sg |
| Geospatial (Locational factor) | Shopping Malls | .csv | https://github.com/ValaryLim/Mall-Coordinates-Web-Scraper |
| Geospatial (Locational factor) | Bus Stops | .shp | data.gov.sg |
| Geospatial (Locational factor) | Dengue clusters | .geojson | data.gov.sg |
| Geospatial (Locational factor) | Parks | .kml | data.gov.sg |
| Geospatial (Locational factor) | Kindergartens | .shp | data.gov.sg |
3. Getting Started
3.1 Installing and Loading the R packages
The R packages installed that we will be using for take-home assignment 3 are:
sf: used for importing, managing, and processing geospatial data
tmap: used for creating thematic maps, such as choropleth and bubble maps
tidyverse: a collection of packages for data science tasks
sfdep: An interface for ‘spdep’ to integrate with ‘sf’ objects and the ‘tidyverse’
plotly: used for creating interactive and dynamic visualisations in R
olsrr: designed for use with ordinary least squares (OLS) regression
GWmodel: (TO CONTINUEEEE)
4. Data Wrangling: Geospatial Data & Aspatial Data
4.1 Importing Aspatial Data
resale <- read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")From the results above, we can tell that:
The dataset contains 11 columns with 148,000 rows in total.
The timeframe of the dataset is from 2017 January to 2023 February up to date (from the review of dataset).
The columns that are present in the data are: month, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, lease_commence_date, remaining_lease, resale_price (from the review of dataset).
In this take-home assignment, I selected HDB 4-room flat resale prices to analyse during the transaction period from 1st January 2021 to 31st December 2022. Therefore, we will need to filter and only extract data during this period of time frame.
4.1.1 Filter Resale Data
As mentioned in the previous section of 4.1, we are only interested in (1) HDB 4-room flat resale prices during the period of (2) 1st January 2021 to 31st December 2022. Let’s filter them!
rs_subset <- filter(resale,flat_type == "4 ROOM") %>%
filter(month >= "2021-01" & month <= "2022-12")From the results above, we can tell that:
We have successfully filtered our data based on earlier chosen HDB model flat and transaction period!
From January 2021 to December 2022, there are 23,657 transactions for 4-room flat in Singapore.
Additionally, we will need to extract another data set for our testing model which is between the period of January 2023 to February 2023.
rs_subset_test <- filter(resale,flat_type == "4 ROOM") %>%
filter(month >= "2023-01" & month <= "2023-02")From the results above, we can tell that:
We have successfully filtered our test data as well!
From January 2023 to February 2023, there are 1848 transactions for 4-room flat in Singapore.
4.1.2 Transform Resale Data
After we have extracted the rows of transactions we are interested in, we will then proceed to use mutate function of dplyr package to create new variables (columns) in a data frame by applying some transformations to the existing columns.
What we will need to do is:
address: concatenation of the block and street_name columns using paste() function of base R package.
remaining_lease_yr & remaining_lease_mnth: Split the year and months part of the remaining_lease respectively using str_sub() function of stringr package then converting the character to integer using as.integer() function of base R package.
After performing mutate function, we will store the new data in rs_transform.
rs_transform <- rs_subset %>%
mutate(rs_subset, address = paste(block,street_name)) %>%
mutate(rs_subset, remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2))) %>%
mutate(rs_subset, remaining_lease_mnth = as.integer(str_sub(remaining_lease, 9, 11)))After we have successfully added the three variables (address, remaining_lease_yr, and remaining_lease_mnth) into a new data named rs_transform, we will see some NA values in the remaining_lease_mnth column. Therefore, we will need to replace those with a value of 0 using is.na() function of base R package.
rs_transform$remaining_lease_mnth[is.na(rs_transform$remaining_lease_mnth)] <- 0
rs_transform# A tibble: 23,657 × 14
month town flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 2021-01 ANG MO… 4 ROOM 547 ANG MO… 04 TO … 92 New Ge… 1981 59 yea…
2 2021-01 ANG MO… 4 ROOM 414 ANG MO… 01 TO … 92 New Ge… 1979 57 yea…
3 2021-01 ANG MO… 4 ROOM 509 ANG MO… 01 TO … 91 New Ge… 1980 58 yea…
4 2021-01 ANG MO… 4 ROOM 467 ANG MO… 07 TO … 92 New Ge… 1979 57 yea…
5 2021-01 ANG MO… 4 ROOM 571 ANG MO… 07 TO … 92 New Ge… 1979 57 yea…
6 2021-01 ANG MO… 4 ROOM 134 ANG MO… 07 TO … 98 New Ge… 1978 56 yea…
7 2021-01 ANG MO… 4 ROOM 204 ANG MO… 07 TO … 92 New Ge… 1977 55 yea…
8 2021-01 ANG MO… 4 ROOM 429 ANG MO… 04 TO … 92 New Ge… 1978 56 yea…
9 2021-01 ANG MO… 4 ROOM 413 ANG MO… 10 TO … 92 New Ge… 1979 57 yea…
10 2021-01 ANG MO… 4 ROOM 415 ANG MO… 07 TO … 92 New Ge… 1979 57 yea…
# … with 23,647 more rows, 4 more variables: resale_price <dbl>, address <chr>,
# remaining_lease_yr <int>, remaining_lease_mnth <dbl>, and abbreviated
# variable names ¹flat_type, ²street_name, ³storey_range, ⁴floor_area_sqm,
# ⁵flat_model, ⁶lease_commence_date, ⁷remaining_lease
Now, as we scroll to the remaining_lease_mnth column, we noticed all initial “NA” values have been replaced by 0!
Next, we do not want to segregate the remaining lease in years and months columns. Instead, we could convert the remaining_lease_yr to months unit and create a new column call total_remaining_lease for easier analysis later using mutate function of dplyr package which contains the summation of the remaining_lease_yr and remaining_lease_mnth using rowSum() function of base R package. Here is how we do it!
# Multiply remaining_lease_yr column in months unit
rs_transform$remaining_lease_yr <- rs_transform$remaining_lease_yr * 12
# Create a new column: total_remaining_lease to contain the summation of yr and mnth
rs_transform <- rs_transform %>%
mutate(rs_transform, total_remaining_lease = rowSums(rs_transform[, c("remaining_lease_yr", "remaining_lease_mnth")])) %>%
select(month, town, address, block, street_name, flat_type, storey_range, floor_area_sqm, flat_model,
lease_commence_date, total_remaining_lease, resale_price)
# Display head of data
head(rs_transform)# A tibble: 6 × 12
month town address block stree…¹ flat_…² store…³ floor…⁴ flat_…⁵ lease…⁶
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 2021-01 ANG MO … 547 AN… 547 ANG MO… 4 ROOM 04 TO … 92 New Ge… 1981
2 2021-01 ANG MO … 414 AN… 414 ANG MO… 4 ROOM 01 TO … 92 New Ge… 1979
3 2021-01 ANG MO … 509 AN… 509 ANG MO… 4 ROOM 01 TO … 91 New Ge… 1980
4 2021-01 ANG MO … 467 AN… 467 ANG MO… 4 ROOM 07 TO … 92 New Ge… 1979
5 2021-01 ANG MO … 571 AN… 571 ANG MO… 4 ROOM 07 TO … 92 New Ge… 1979
6 2021-01 ANG MO … 134 AN… 134 ANG MO… 4 ROOM 07 TO … 98 New Ge… 1978
# … with 2 more variables: total_remaining_lease <dbl>, resale_price <dbl>, and
# abbreviated variable names ¹street_name, ²flat_type, ³storey_range,
# ⁴floor_area_sqm, ⁵flat_model, ⁶lease_commence_date
Upon inspection of the rs_transform, we now only left with one column: total_remaining_lease that contains all the remaining lease in months!
4.1.3 Retrieve Postal Codes and Coordinates of Addresses
In this section, we will focus on retrieving the relevant data like postal codes and coordinates of the address which is required to get the proximity to locational factors in the later parts.
Here are the steps to add its longitude and latitude features with OneMapSG API!
Step 1: Create a list storing unique addresses
add_list <- sort(unique(rs_transform$address))Step 2: Create function to retrieve coordinates from OneMapSG API
get_coords <- function(add_list){
# Create a data frame to store all retrieved coordinates
postal_coords <- data.frame()
for (i in add_list){
r <- GET('https://developers.onemap.sg/commonapi/search?',
query=list(searchVal=i,
returnGeom='Y',
getAddrDetails='Y'))
# Send a GET request to OneMap API with address as searchVal,
# returnGeom as 'Y' to retrieve the coordinates, and getAddrDetails as 'Y' to retrieve the postal code
data <- fromJSON(rawToChar(r$content))
found <- data$found
res <- data$results
# Extract the 'found' and 'results' fields from the API reponses
# Create a new data frame for each address
new_row <- data.frame()
# If single result, append
if (found == 1){
postal <- res$POSTAL
lat <- res$LATITUDE
lng <- res$LONGITUDE
new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
}
# If multiple results, drop NIL and append top 1
else if (found > 1){
# Remove those with NIL as postal
res_sub <- res[res$POSTAL != "NIL", ]
# Set as NA first if no Postal
if (nrow(res_sub) == 0) {
new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
}
else{
top1 <- head(res_sub, n = 1)
postal <- top1$POSTAL
lat <- top1$LATITUDE
lng <- top1$LONGITUDE
new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
}
}
else {
new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
}
# Add the row
postal_coords <- rbind(postal_coords, new_row)
}
return(postal_coords)
}Step 3: Call get_coords function to retrieve resale coordinates
coords <- get_coords(add_list)4.1.4 Combine Resale and Coordinates Data
After we have done retrieving the location coordinates of all the resale HDBs, we need to now combine our resale data (rs_transform) earlier with the coordinates data (coords) using left_join() function.
rs_coords <- left_join(rs_transform, coords, by = c('address' = 'address'))Great! We have successfully joined the two data sets and now let’s write the file to our rds folder!
rs_coords_rds <- write_rds(rs_coords, "data/rds/rs_coords.rds")Now, let’s read rs_coords RDS file:
rs_coords <- read_rds("data/rds/rs_coords.rds")4.1.5 Assign and Transform CRS and Check
The coordinate columns (latitude, longitude) are currently in decimal degrees, the projected CRS will be WGS84. We will need to convert it into a spatial data frame with projected coordinates of 3414.
rs_coords_sf <- st_as_sf(rs_coords,
coords = c("longitude",
"latitude"),
crs=4326) %>%
st_transform(crs = 3414)st_crs(rs_coords_sf)Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
4.1.5.1 Check for Invalid Geometries
length(which(st_is_valid(rs_coords_sf) == FALSE))[1] 0
We have no invalid geometries! Now, let’s plot hdb resale points
tmap_mode("view")
tm_shape(rs_coords_sf)+
tm_dots(col="red", size = 0.02)